A very lightweight image super-resolution network

Recently, ConvNeXt and blueprint separable convolution (BSConv) constructed from standard ConvNet modules have demonstrated competitive performance in advanced computer vision tasks. This paper proposes an efficient model (BCRN) based on BSConv and the ConvNeXt residual structure for single image super-resolution, which achieves superior performance with very low parametric numbers. Specifically, the residual block (BCB) of the BCRN utilizes the ConvNeXt residual structure and BSConv to significantly reduce the number of parameters. Within the residual block, enhanced spatial attention and contrast-aware channel attention modules are simultaneously introduced to prioritize valuable features within the network. Multiple residual blocks are then stacked to form the backbone network, with Dense connections utilized between them to enhance feature utilization. Our model boasts extremely low parameters compared to other state-of-the-art lightweight models, while experimental results on benchmark datasets demonstrate its excellent performance. The code will be available at https://github.com/kptx666/BCRN.


Efficient image super-resolution
Existing SR models tend to introduce a large computational overhead in improving performance, which limits the practical application of these methods.Therefore, many works have been proposed to design more efficient SR models.FSRCNN 8 , in order to speed up the model, directly feeds the original LR image into the CNN model without the need to upsample the original LR image for amplification using bi-trivial interpolation as in SRCNN 9 , and finally uses transposed convolution before the output layer of the network to amplify the size of the image.In addition, the advantage of this structure is that for models that need to be trained with different upsampling magnifications, it only needs to fine-tune the final transposed convolution layer, which greatly saves the time for training models with different magnifications.The recursive structure refers to the use of recursively connected convolutional layers or by recursively connected convolutional units, whose main motivation is to gradually decompose a complex SR task into a set of simple tasks that are easy to solve.The advantage of recurrent networks is that the same convolution can be performed repeatedly many times, but the number of parameters remains the same.DRCN 2 introduces the existing recurrent neural networks to image SR reconstruction tasks for the first time and further deepens the model structure using the idea of residual jump connections, which increases the perceptual field of the network while keeping the parameters unchanged, and thus improves the performance of the model.Ahn et al. 1 by using cascaded network architecture proposed CARN-M for mobile devices at the cost of a significant decrease in PSNR.Hui et al. 3 proposed an information distillation network IDN that explicitly splits the intermediate features into two parts along the channel dimension, keeping one and the other being further processed by subsequent convolutional layers.By using this channel splitting strategy, the IDN can aggregate current information with partially preserved local short path information and obtain good performance at moderate sizes.Later, the IMDN 4 further improves the IDN by designing Information Multi-Distillation Blocks (IMDB) that extract features at the granular level.Specifically, the channel splitting strategy is applied several times in the IMDB.Each time, a part of the features is retained and another part is sent to the next step.IMDN has good performance in terms of PSNR and inference time.Kong et al. 10 proposed a new residual local feature network for SR reconstruction, whose main idea is to use three convolutional layers for residual local feature learning to simplify feature aggregation, achieving a good balance between model performance and inference time.The residual feature distillation network RFDN 11 further makes the network lighter by using feature distillation connections (FDCs) based on the network architecture of IMDN 4 , and in addition, a shallow residual block (SRB) is proposed to build the backbone network of RFDN 11 to further improve the SR performance.The SRB consists of a convolutional layer, a constant connection and an activation unit.Compared with normal convolution, it can benefit from residual learning without introducing additional parameters and can improve the performance of SR reconstruction.ClassSR proposed by Kong et al. 12 .can accelerate almost all deep learning-based methods for super-resolution reconstruction of large images (2-8 K).The core idea is to use a class module to divide the sub-image blocks into different complexity levels (e.g., simple, medium, difficult), and each level corresponds to different processing branches, each with different network capacity, complexity and thus significantly reduce the computational effort.

Attention mechanism
The attention mechanism has become an integral part of deep convolutional neural networks.The attention mechanism allows the model to focus more on the most useful features for its task during the feature extraction phase, which can improve the performance of the model, and in addition the attention mechanism introduces  a minimal number of additional parameters.Hu et al. 13 proposed a channel attention network, SENet, based on ResNet 14 by designing a lower cost channel attention and introducing it into each residual block, which significantly improved the accuracy of the image classification task.Wang et al. 15 proposed an efficient channel attention (ECA) by analyzing the channel attention module in the improved SENet.Woo et al. 16 proposed a simple and effective convolutional block attention module (CBAM), where for a given intermediate feature map, the CBAM inferred the attention map sequentially along the channel and spatial dimensions, and then multiplied the attention map by the input feature map to perform adaptive feature refinement.Since CBAM is a lightweight and general-purpose module, it can be seamlessly integrated into any CNN architecture with negligible overhead.Dai et al. 17 propose a second-order attention network (SAN) for more powerful feature representation and feature relevance learning, specifically a novel trainable second-order channel attention (SOCA) module that performs second-order feature statistics by using more discriminative representation to adaptively rescale the channel features.Wang et al. 18 proposed a nonlocal module for generating attention maps inspired by the classical nonlocal mean approach in computer vision.Zhang et al. 19,20 proposed a very deep residual attention network, RCAN 19 , which enables the network to focus on learning high-frequency information, thus improving the performance of SR models.They then proposed a nonlocal attention network RNAN 20 for various image restoration tasks by introducing a spatial attention module.Hui et al. proposed an information multi-distillation network IMDN with contrast-aware attention to improve the performance of SR models.Liu et al. 21proposed a residual feature aggregation network for SR reconstruction, and they enhanced the model by enhancing the spatial attention module to make it more focus on the critical spatial information to further improve the performance of SR reconstruction.

Method Network architecture
In this section, the BCRN model is described in detail, and the overall network structure is shown in Fig. 2. It consists of four stages: shallow feature extraction, deep feature extraction, multi-layer feature aggregation and reconstruction.
We denote I LR and I SR as the input and output images, BCRN as the SR model, and the reconstruction process of the BCRN model can be expressed as The shallow feature extraction part is to map the input image to a higher dimensional feature space, which can be expressed as where H sf (•) denotes shallow feature extraction using a BSConv 6 of size 3 × 3.
Deep features are extracted step by step from multiple residual blocks BCB using shallow features F 0 for deep feature extraction.This process can be expressed as Then, a BSConv 6 of size 3 × 3 is used to aggregate the features, representing the multi-layer feature aggrega- tion as where H fusion (•) denotes feature aggregation and F 1 , • • • , F k denotes the features extracted from the kth residual block BCB, and F fused is the feature after aggregation.
Since residual learning is performed using skip connections, the final reconstruction is represented as where H rec denotes the reconstruction module, which consists of a BSConv 6 of size 3 × 3 and a sub-pixel con- volution operation.
The final optimization model BCRN using L1 loss function can be expressed as

Residual block BCB
As shown in Fig. 3a, the depth-separable convolution, which decomposes the standard convolution into a depth convolution and a point-by-point convolution.As shown in Fig. 3b, the blueprint separable convolution decomposes the standard convolution into point-by-point convolution and depth convolution.It is the inverse version of the depth-separable convolution.The original paper 6 shows that the BSConv performs better with a reduced number of model parameters, so it is used in the residual block BCB.Since the original ConvNeXt 5 residual structure uses a grouped convolution of size 7 × 7 , which causes the problem of large number of network parameters, the grouped convolution of size 3 × 3 is used in the ConvNeXt residual structure of this paper, and the original inverse bottleneck layer design is retained.The specific ConvNeXt residual structure, as shown in Fig. 4. In the ConvNeXt residual structure, the output features with wider channel dimension before the activation function are used in the residual block BCB since the model with wider features before the activation function can significantly improve the performance of SR reconstruction.
As shown in Fig. 5, the residual block BCB consists of one BSConv 6 of 3 × 3 size and one ConvNeXt residual structure, which is then connected to the ESA 7 module and the CCA 4 module.
For a given input feature F in , the residual block BCB structure is represented as where, BSConv(•) denotes BSConv 6 of size 3 × 3, ConvNeXt(•) denotes ConvNeXt residual structure, and F out denotes the output feature.Next, the feature F out is fed into the ESA 7 and CCA 4 modules to obtain the final output of the residual block BCB.

ESA and CCA modules
ESA 7 module, which is used to improve the ability of the SR reconstruction model to collect various fine features, i.e., to use more useful features such as edges, corners, textures, etc. for SR reconstruction.Since the effectiveness of the ESA module has been proven, it is introduced into the model BCRN.The structure of the ESA module, as shown in Fig. 6.Specifically, to keep the ESA module light enough, it applies a convolution of 1 × 1 size at the

DConv-3,dim
Conv-1,4dim Conv-1,dim www.nature.com/scientificreports/beginning to perform the reduction of the channel dimension of the input features, and then uses a stepwise convolution and a maximum pooling layer to reduce the size of the feature map.After a set of convolutions to extract features, interpolation-based upsampling is performed to recover the original feature map size and 1 × 1 convolution is applied to recover to the original channel dimension.Finally the attention mask is generated by a sigmoid layer.The CCA 4 module is an improved version of the channel attention mechanism proposed by the IMDN model, which facilitates the enhancement of details of the reconstructed image, such as structure, texture and edge information.It is different from the traditional channel attention calculated using the mean value of each channel feature.As shown in Fig. 7, it applies the sum of mean and standard deviation at the beginning to generate contrast information, then performs channel dimension reduction of the input features by convolution of 1 × 1 size, and then restores the original dimension size by convolution of 1 × 1 size.Finally, the attention mask is generated by the sigmoid activation function layer.Since the effectiveness of the CCA module has been proven, this module is introduced into the BCRN model.

Datasets and metrics
The training images consist of 2650 images from Flickr2K 22 and 800 images from DIV2K 23 .We use the five standard benchmark datasets of Set5 24 , Set14 25 , B100 26 , Urban100 27 , and Manga109 28 to evaluate the performance of different approaches.The average peak signal to noise ratio (PSNR) and the structural similarity (SSIM) on the Y channel are exploited as the evaluation metrics.

Implementation details of BCRN
The proposed BCRN consists of 6 residual blocks BCB and the number of channels is set to 64.The size of all the depth convolutions is 3 × 3 .The LR images are generated by ×2 and ×4 downsampling of the HR images by MATLAB using bi-cubic interpolation.For model training, LR image blocks of size 48 × 48 were randomly cropped from the LR images as inputs, and the size of the number of inputs per batch was set to 128, and the training data were enhanced with random horizontal flips and 90 degree rotations.The model is trained using Adam 29 optimizer with momentum parameters β 1 = 0.9 , β 2 = 0.999 , ε = 10 −8 .The initial learning rate of the model is set to 1 × 10 −3 and is reduced by half after every 2 × 10 5 iterations.When training the final model, the ×2 model BCRN is trained from scratch and used as a pre-trained model for the ×4 model after the model con- verges.The model BCRN was implemented on the Pytorch framework and trained on a GeForce RTX 3090 GPU.

Ablation study
In this section, we first argue the effectiveness of the two attention modules and the residual block BCB.Then, we compare the effects of different activation functions, and finally, we further prove the effectiveness of the proposed BCRN.

Effectiveness of ESA and CCA
To verify the effectiveness of the ESA 7 and CCA 4  shows a smaller performance degradation with about 1% decrease in the number of parameters.The complete BCRN model showed significant metric improvement on the Set5, Set14, B100, Urban100 and Manga109 datasets.In summary, both ESA and CCA modules can effectively improve the performance of the BCRN model, and therefore they are introduced.

Exploration of different activation functions
BCRN uses the GELU 30 activation function.However, most previous SR models used ReLU 31 or LeakyReLU 32 as the activation function.Therefore, we investigated the effects of these three activation functions on the SR model.The results in Table 2 show that among these activation functions, GELU obtains a significant performance improvement.Therefore, we chose to retain GELU, as the activation function in our model.

Analysis of residual block BCB ablation
To better weigh the model performance against the number of parameters, the effect of the number of residual blocks BCB on the model performance and the number of parameters is further investigated.As can be seen from Table 3, both PSNR and SSIM increase with the increase in the number of BCB residual blocks.However, increasing the number of BCB residual blocks to more than 7, there is a smaller improvement in the performance of the proposed network.In summary, six BCB residual blocks are therefore used to construct the backbone network of the BCRN model.

Comparison of model inference time
For the inference time comparison, it is the average of 10 runs on the Urban100 dataset.From Table 4, we can see that BCRN has the least number of Params, Multi-Adds and average running time compared to IMDN 4

Comparison with state-of-the-art methods
We compare the proposed BSRN with state-of-the-art lightweight SR approaches, including SRCNN 9 , FSRCNN 8 , LapSRN 34 , VDSR 35 , DRCN 2 , SRDenseNet 36 , CARN-M 1 , CARN 1 , DRRN 37 , MemNet 38 , SRFBN-S 39 , SelNet 40 , LAPAR-A 33 , SRMDNF 41 .Table 5 shows the quantitative comparison results for different upscale factors.We also provide the number of parameters and Multi-Adds calculated on the 1280 × 720 output.Compared to other lightweight SR methods, our BCRN achieves the best performance with only 287-289K parameters and almost the fewest Multi-Adds.The qualitative comparison is demonstrated in Figs. 8 and 9 and our approach can also obtain the best visual quality compared to the state-of-the-art methods.

Conclusion
In this paper, we propose a lightweight and efficient SR model BCRN.Specifically, by using simple and efficient residual blocks and a simple layer connection strategy, our network is made lighter and faster.In addition, we use effective ESA and CCA attention modules to enhance the model's ability to collect fine-grained information.Then, we investigate the effect of activation functions on BCRN to explore the best choice.Extensive experiments show that our proposed BCRN achieves a good balance between model size, performance, and computational cost compared with other lightweight SR models.In the future, we will further explore more efficient SR models to obtain better performance.

Figure 1 .
Figure 1.Performance and model complexity comparison on Set5 dataset for upscaling factor × 4.

Figure 8 .
Figure 8. Visual comparison of BCRN with the state-of-the-art methods on ×4 SR.

Figure 9 .
Figure 9. Visual comparison of BCRN with the state-of-the-art methods on ×2 SR.
modules, relevant ablation experiments are performed.From the data in Table1, it can be concluded that the BCRN model without ESA module shows a significant performance degradation with about 8% decrease in the number of parameters, and the BCRN model without CCA module Figure 7. Structure of CCA module.Vol:.(1234567890)Scientific Reports | (2024) 14:13850 | https://doi.org/10.1038/s41598-024-64724-ywww.nature.com/scientificreports/

Table 1 .
Effectiveness of ESA and CCA.

Table 2 .
Exploration of different activation functions.

Table 3 .
Analysis of residual block BCB ablation.

Table 4 .
33mparison of model inference time.andLAPAR-A33.For the inference time, the inference time of the BCRN model is relatively large because GPU computation is not friendly to deep convolution.

Table 5 .
Quantitative comparison with state-of-the-art methods on benchmark datasets.The best and secondbest performance are in italics and bold significant, respectively.'Multi-Adds' is calculated with a 1280 × 720 GT image.